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Abstract. Standard tests of the “no-treatment-effect” hypothesis for a 
comparative experiment include permutation tests, the Wilcoxon rank 
sum test, two-sample t tests, and Fisher-type randomization tests. 
Practitioners are aware that these procedures test different no-effect 
hypotheses and are based on different modeling assumptions. How¬ 
ever, this awareness is not always, or even usually, accompanied by 
a clear understanding or appreciation of these differences. Borrowing 
from the rich literatures on causality and finite-population sampling 
theory, this paper develops a modeling framework that affords answers 
to several important questions, including: exactly what hypothesis is 
being tested, what model assumptions are being made, and are there 
other, perhaps better, approaches to testing a no-effect hypothesis? The 
framework lends itself to clear descriptions of three main inference ap¬ 
proaches: process-based, randomization-based, and selection-based. It 
also promotes careful consideration of model assumptions and targets 
of inference, and highlights the importance of randomization. Along 
the way, Fisher-type randomization tests are compared to permutation 
tests and a less well-known Neyman-type randomization test. A simula¬ 
tion study compares the operating characteristics of the Neyman-type 
randomization test to those of the other more familiar tests. 
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1. INTRODUCTION 

We begin with a simple example of a randomized 
comparative experiment. Researchers are interested 
in determining whether cell phone use while driv¬ 
ing has an impact on reaction times. Toward this 
end, 64 University of Utah student volunteers were 
enlisted to take part in a randomized comparative 
experiment (Strayer and Johnston, 2001). Of the 64 
students, 32 were randomized to treatment 1 (oper¬ 
ate a driving simulator while using a cell phone) and 
32 were randomized to treatment 2 (operate a driv¬ 
ing simulator without a cell phone). For a summary 
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description of the data and of the way the two treat¬ 
ments were actually administered, see Agresti and 
Franklin (2007, page 446). In the driving simulation, 
each student encountered several red lights at ran¬ 
dom times. Each student’s response was the average 
time required to stop when a red light was detected. 
The 64 responses, in milliseconds, are recorded in 
Table 1. 

Is there a cell phone use effect? Generically, is 
there a treatment effeet? 

Standard tests of the “no-treatment-effect” hy¬ 
pothesis include permutation tests (Pitman, 1937, 
1938), the Wilcoxon rank sum test (Wilcoxon, 
1945), two-sample t tests (cf. Welch, 1938), and 
Fisher-type randomization tests (Eden and Yates, 
Fisher, 1933, 1935; see also the history in David, 
2008). Most practitioners are aware that these pro¬ 
cedures test different “no-effect” hypotheses and are 
based on different modeling assumptions. However, 
this awareness is not always, or even usually, accom¬ 
panied by a clear understanding or appreciation of 
these differences. This paper looks at each of these 
testing approaches and addresses the all important 
questions, exactly what hypothesis is being tested 
and what model assumptions are being made? Along 
the way, we will have to confront several other ques¬ 
tions such as, how is the definition of treatment effect 
operationalized, what is the actual target of infer¬ 
ence, what is the role of randomization, and are 
there other, perhaps better, approaches to testing a 
no-effect hypothesis? 

To address these questions, we draw on ideas from 
the rich literature on causal analysis. In particu¬ 
lar, we employ the useful concept of “potential vari¬ 
ables.” Although the idea of potential variables can 
be traced back to Neyman (1923), Rubin, begin¬ 
ning with a series of papers on causal models in the 
1970s (see Rubin, 2010, and references therein) is 
usually credited with more explicitly stating the po¬ 
tential variable model and extending it to both ran¬ 
domized and nonrandomized design settings, with 
or without covariates (see Rubin’s causal model, 
Holland, 1986). Between Neyman and Rubin, poten¬ 
tial variables were used by relatively few authors; 
Welch (1937), Kempthorne (1952, 1955), and Cox 
(1958a), Section 5, were among the notable early 
proponents. Around the time of and after Rubin, 
many more authors made important contributions 
to the potential variables literature. See, for exam¬ 
ple, Copas (1973), Bailey (1981), Holland (1986), 


Greenland (1991, 2000), Gadbury (2001), and the 
references therein. 

To be clear, it is not the goal of this paper to 
summarize the vast literature on potential variables 
and causal modeling. (To this end, see Paul R. 
Rosenbaum’s very informative website and referen¬ 
ces therein, www-stat.wharton.upenn/'rosenbap/ 
downloadTalks.htm.) Instead, the first goal is to ex¬ 
ploit the benefits of hindsight to develop a mod¬ 
eling framework that supports clear descriptions 
and comparisons of the different testing approaches, 
and promotes careful consideration of the model as¬ 
sumptions and targets of inference. This modeling 
framework and associated notation draws clear dis¬ 
tinctions between realizations and random variables, 
and between observed and unobserved data. It ac¬ 
commodates both treatment assignment and sam¬ 
pling from populations, and clearly differentiates be¬ 
tween the two. Although the proposed model lends 
itself to generalizations in many directions (e.g., 
more than two treatments, restricted randomiza¬ 
tion, etc.), to simplify exposition, we will focus on 
the two-treatment comparative experiment setting. 
This restriction allows us to more directly highlight 
the useful features of the proposed modeling frame¬ 
work. 

The second goal of this paper is to address the 
question of availability of other testing approaches, 
besides the four common ones mentioned above. To¬ 
ward this end, we revisit ideas introduced in Ney¬ 
man (1923). Using the model structure introduced 
herein, we describe a less well-known Neyman-type 
randomization test, which is qualitatively different 
than the Fisher-type randomization test (cf. Welch, 
1937; Rubin, 1990, 2004, 2010). (Readers with an 
interest in history are encouraged to read Ney¬ 
man et ah, 1935, along with the discussions, to see 
how Neyman and Fisher publicly aired their differ¬ 
ences of opinions on testing in randomized design 
settings.) The Neyman randomization test, which 
uses a less restrictive “no-effect” hypothesis than 
Fisher’s, is based on a test statistic with the com¬ 
mon form, (estimator minus estimand)/(standard 
error of estimator). Neyman, with an eye on inter¬ 
val estimation rather than testing, derived the stan¬ 
dard error with respect to a randomization distri¬ 
bution using tools from finite-population sampling 
theory. In retrospect, Neyman’s derivation approach 
is hardly surprising given that he “may be said to 
have initiated the modern theory of survey sam¬ 
pling” (Lehmann, 1994) in his landmark paper of 
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Cell phone: 

Control; 

Generically.. 
Treatment 1: 
Treatment 2: 


Table 1 

Reaction times (milliseconds) 

636 623 615 672 601 600 542 554 543 520 609 559 595 565 573 554 

626 501 574 468 578 560 525 647 456 688 679 960 558 482 527 536 

557 572 457 489 532 506 648 485 610 444 626 626 426 585 487 436 

642 476 586 565 617 528 578 472 485 539 523 479 535 603 512 449 


2 / 1 , 1 , 2 / 1 , 2 , . . . , 2 / 1,32 
2 / 2 , 1 , 2 / 2 , 2 , . . ■ , 2 / 2 , 32 . 


1934 (Neyman, 1934). Compared to Fisher random¬ 
ization tests, the Neyman tests do have their advan¬ 
tages and disadvantages. One disadvantage is that 
Neyman tests are approximate, whereas Fisher tests 
are exact. An advantage is that the Neyman test can 
be more powerful than the Fisher test (see Section 7 
below). Another advantage is that, unlike the Fisher 
randomization test, the Neyman version can be used 
to test hypotheses about a population when units 
are randomly sampled from the population and then 
randomized to treatment levels. 

The third and final goal of this paper is to compare 
the operating characteristics of the five tests: the 
permutation test, the Wilcoxon rank sum test, the 
two-sample t-test, the Fisher randomization test, 
and the Neyman randomization test. The penul¬ 
timate section of the paper includes a small-scale 
simulation study of the size and power of these five 
tests. Based on these comparisons, we make tenta¬ 
tive recommendations on which test to use in differ¬ 
ent settings. 

The remainder of this paper is organized as fol¬ 
lows: Section 2 introduces potential variables and 
recasts the data in Table 1 within this framework. 
Section 3 introduces a sequential data generation 
model that explicitly accommodates both random 
sampling and randomization. The components in 
the three-level sequential model are identified as 
the “process,” the “sampling,” and the “randomiza¬ 
tion.” This model, along with a useful component- 
selection notation, leads to an explicit identification 
of the observed data and the three main targets of 
inference. Section 4 gives candidate definitions of 
treatment effects that are based on potential vari¬ 
ables; corresponding no-treatment-effect hypotheses 
are also given. An overview of the three main in¬ 
ference approaches—process-based, selection-based, 
and randomization-based—is given in Section 5. 
Section 6 introduces a difference statistic that can be 
used as the basis for tests of the no-treatment-effect 
hypotheses. Tests corresponding to each of the three 


inference approaches, along with assumptions for 
their validity, are described in detail; some of these 
tests are well known and some are less well known. 
Section 7 carries out an analysis of the cell phone 
data and includes a small-scale simulation study of 
the operating characteristics of the different testing 
approaches discussed herein. Finally, Section 8 in¬ 
cludes a brief discussion. 

2. WHAT MIGHT HAVE BEEN: THE 
POTENTIAL VARIABLES VIEWPOINT 

Going back to Neyman (1923) and following the 
lead of Welch (1937), Kempthorne (e.g., 1955, 1977), 
Cox (1958a), and Rubin (e.g., 2005), we will view 
the data as observed values of a sample of “potential 
values.” 

Consider a population of N units that are, with¬ 
out loss of generality, identified by the numbers 1 
through N] in symbols, we will let P = (1,...,A^) 
represent the unit identifiers for the population. For 
convenience, we will also refer to P as the population 
of units. Let Yt,i be the response for unit i when ex¬ 
posed to treatment t, where i = 1,... ,N and t = 1,2. 
The response variables Yu and 42./ are called po¬ 
tential variables for reasons made clear in the next 
paragraph. 

The introduction of these potential variables leads 
to intuitively appealing definitions of treatment ef¬ 
fects that are based on head-to-head comparisons of 
Yi.j and 12./• There is a catch, however. Although 
there is the potential to observe either Yi,* or 1 ^ 2 ./) 
unfortunately, it is not possible to observe both. 
Strictly speaking, it is not possible to observe the 
values of both potential variables because the same 
subject cannot be simultaneously exposed to both 
treatments. To the potential variable advocates, this 
is the “fundamental problem of causal analysis” 
(Holland, 1986). As an example, if we observe the 
value of I 2 ./) then the value of li.j, and hence the dif¬ 
ference Yu — I 2./5 cannot be observed. In this case, 
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Table 2 

Potential values notation 


Treatment 1: {H-.m 2^92 J/i sa, ■ ■ ■ ,?/r^a 3 2/i s 64 Only the 32 non- X ’ed out values are observed. 

Treatment 2: j/ 2 .si, 1/2. S 2 , ■ ■ ■, ?/ 2 .ss 3 , y^e 4 Only the 32 non- X ’ed out values are observed. 

Here, s = (si,..., S64) is a sample from some population P = (1,..., N), N > 64. 


the unobserved value of Yi.i is relegated to counter- 
factual status; the value is “what might have been” 
had unit i been exposed to treatment 1 rather than 
treatment 2. 

The data in Table 1 can be viewed as observed 
values of a sample of the potential variable values. 
Specifically, a sample s of size n = ni -|- n 2 = 64 is 
taken, without replacement, from the population P. 
That is, s = (si,..., Sn), where Sj G P and Sj ^ Sj/. 
One of the two treatments will be assigned to each 
of the units in the sample s. For the example, treat¬ 
ment 1 was assigned to ni = 32 and treatment 2 was 
assigned to n 2 = 32 of the 64 sampled units. 

Let ut.sj be the response value for sampled subject 
Sj when exposed to treatment t. That is, yt.sj is a 
realization of Of course, for each subject Sj, 

only one of the realizations, yi,sj or y 2 .sj, will be 
observed. From a potential variables viewpoint, the 
original data in Table 1 can be viewed as follows: 

Remark. Table 1 used the conventional yt,i,i = 
1,..., 32, t = 1,2 to represent the observed data, 
whereas Table 2 uses y 2 .si , y 2 .s 2 > Visa , • • •, y 2 .s 63 > Vi-sqa ■ 
It is important to note that the symbols ytj and yt,i 
represent very different objects. For example, in Ta¬ 
ble 1 , of those units sampled and assigned treatment 
1, the 3rd had a process value of yi^ 3 , and of those 
units sampled and assigned treatment 2, the 3rd had 
a process value of y 2 , 3 - That is, and ?/ 2,3 are pro¬ 
cess values for two distinct units. In contrast, yi .3 
and 2 / 2.3 represent the process values under treat¬ 
ments 1 and 2 for the same unit, specifically, the 
3rd unit in the population. 

3. DATA-GENERATION MODELS AND 
INFERENCE GOALS 

Let Y_ = {Yi,i,... ,Yi,n ,¥ 2 . 1 ,... ,Y 2 .n) be the vec¬ 
tor of potential variables for the population P and 
y = {yi.i,yi.2,---,yi.N,y2.i,---,y2.N) be the corre¬ 
sponding vector of realizations. We will use this 
notational convention throughout the paper: upper 
case letters for random variables and lower case let¬ 
ters for realizations. 


To simplify and to highlight vector component 
identification, we introduce dot operations and 
a component-selection bracket “[ ]” notation that is 
similar to the matrix syntax used in computer lan¬ 
guages such as R. Let x and w be m-dimensional 
vectors and let fc be a scalar. Define 

x.w = {xi.wi,...,Xm-Wm) and 

k.x = {k.xi ,..., k.Xm)- 

Consider an m-dimensional vector x with com¬ 
ponents identified by subscripts ai,...,am, that 
is, x = (Xai,... ,Xa^). Provided b= {bi,...,bq) has 
components bi G {ai,..., am}, for each i = 1,... ,q, 
the vector x\b\ is defined as x[ 6 ] = x[bi,... ,bg] = 

{Xbi , • • • , Xbq)- 

As an example, y={yi.i,...,yi.N,y 2 .i,---,y 2 .N) 
can be expressed as y = y[l-R, 2.P]. Similarly, 
y[l-s] = {yi.si,---,yi.sj andyjTs] = ( 2 /ti. 3 i,...,yG.sn 
We will also use a notation for averages: As exam¬ 
ples, 

N 

i=l 

N 

y[t.P] =N~^J2 y[L^] and 

i=l 

n 

y[t-s]=n-^'^y[t.Sj]. 

The data-generation models we consider in this 
paper are based on the following sequential genera¬ 
tions: 

y^Y 

here, y = (yi.i,...,yi.Ar, 2 / 2 . 1 , •••, 2 / 2 .^), 
s^mY = y) 

( 1 ) 

here, s = (si,.. .,Sn),Sj £P,Sj 7 ^ sj/, 
t^T\{Y = y,S = s) 

here, t = {ti,...,tn),tj G {1,2}. 
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The left arrow ” is read, “is a realization of.” 
Here, Y_ is the collection of 2N potential response 
variables, 5 is the collection of n sampling variables, 
and T is the collection of n treatment assignment 
variables. The sequencing in (1) is not required to 
correspond to the temporal sequencing of data gen¬ 
eration. It is meant only as a device for specifying 
the joint distribution of For a related dis¬ 

cussion, see Rubin [2010, between equations (4) and 

( 5 )]. 

In words, “Nature” generates N units, which are 
labeled 1,2,..., A^. Each unit can potentially expe¬ 
rience either of two “possible worlds,” which corre¬ 
spond to exposure under the two treatments. The 
vector y contains the 2N potential response values, 
one for each of the N units under treatment 1 and 
one for each of the N units under treatment 2. These 
potential deviates in y are viewed as realized, at 
least in theory, but only partially observable. We 
sample n distinct subjects s from the population P. 
The sampling may depend on potential deviates y; 
this dependence often stems from selecting on co¬ 
variates that are statistically related to the poten¬ 
tial variables [see Rubin, 2010, between equations 
(4) and (5)]. Finally, we assign treatment levels t to 
units in the sample, that is, we choose which of the 
two possible worlds we will observe for each unit 
in the sample. The treatment assignment may de¬ 
pend on the potential deviates y and/or the sam¬ 
pled units s. However, when mechanical or physical 
randomization (cf. Fisher, Kempthorne, 1935, 1955) 
is used, the treatment assignment can be made to 
be independent of the potential deviates. 

We will refer to the potential variables Y_ as “pro¬ 
cess variables”^ and the values y as “process val¬ 
ues,” to differentiate them from the “selection” vari¬ 
ables (5,r) and values {s,t). The process portion 
describes how things behave under both possible 
worlds and the selection portion determines how we 
go about observing this behavior. Owing to the sam¬ 
pling and treatment assignment (the selection), we 
do not observe the entire vector of potential devi¬ 
ates y (the process values). Indeed, the “fundamen¬ 
tal problem of causal inference” rules out the pos¬ 
sibility of fully observing the 2Wdimensional data 
vector y. Instead, we observe only the n-dimensional 
sub-vector y[t.s}. Schematically, we have 

y\t-s\ ^ y [l-s, 2.s] C y[l.P, 2.P] = y^Y. 

observed unobserved 


^Rubin (2005), uses the descriptor “science” rather than 
“process.” 


The inference goal of this paper can be stated suc¬ 
cinctly as follows: 

Inference goal. Use the observed data y[t.s\ from 
a comparative experiment to reduce uncertainty 
about one of the three targets: the vector y[l.s, 2.s], 
the vector y[l.R, 2.P], or the distribution of Y_. 

4. TREATMENT EFFECTS AND 
“NO-TREATMENT-EFFECT” HYPOTHESES 

4.1 Treatment Effects 

We began this paper with the question of whether 
there was a treatment effect. Of course, this raises 
another question: What exactly is a “treatment ef¬ 
fect”? 

In a comparative experiment, a treatment effect 
can be viewed as some measure of the difference 
between the response (Y_) distribution or response 
values (y) for treatment level 1 and the response 
distribution or response values for treatment level 
2. The potential variables viewpoint lends itself to 
intuitively-appealing candidate definitions of such 
treatment effects (cf. Neyman, 1923; Rubin, 1990, 
2005, 2010). Some of the candidates considered in 
this paper are as follows: 

Realized unit-specific effects: 
y[l.Sj] -y[2.Sj],j = l,...,n or 
y[l.i] -y[2.i],i = l,...,N. 

Distribution unit-specihc effects: 

Expected unit-specific effects: 

E(y[i.i])-E(y[ 2 .f]),i = i,...,iv. 

Realized aggregate effects: 

Ml-s] -y[2-s] or 

Expected aggregate effects: 

E{Y[1.P]) - E{Y[2.P]). 

Eor example, the realized unit-specific treatment 
effect y[l.Sj] — y[2.Sj] is simply the difference be¬ 
tween unit Sj’s responses under two scenarios or 
two possible worlds—in one world the unit is ex¬ 
posed to treatment 1 and in the other world the 
unit is exposed to treatment 2. As another example, 
the distribution unit-specific effect (5(Fi.i, ^ 2 . 1 ) mea¬ 
sures the distance between the c.d.f.’s of y[l.i] and 
y[2.i] using some distance function (j(-). This lat¬ 
ter example illustrates that treatment effects need 
not be dehned in terms of simple differences, arith¬ 
metic averages, or means of distributions. Other ex¬ 
amples of treatment effects include the distribution 
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unit-specific effect median(y[li]) — median(y[ 2 .z]), 
realized unit-specific effects, such as (y[ 2 .Sj] — 
y[l.Sj])/y[l.Sj\, and realized aggregate effects, such 
y[l.s] — y[ 2 .s]|| or var(y[l.s]) — var(y[ 2 .s]) or 


as I 


y 


-y 


1.5 


ratio = 


ly 


or for binary responses, the realized odds 




etc. 


Unfortunately, none of the treatment effects men¬ 
tioned above is observable. The expected and distri¬ 
bution effects cannot be observed because the distri¬ 
bution of y is not completely known. The realized 
effects cannot be observed because, by the funda¬ 
mental problem of causal inference, only one of the 
realizations, for example, either y[l-Sj] or y[2.Sj], 
can be observed. Fortunately, this does not preclude 
unbiased estimation of some of these unobservable 
treatment effects, as we point out below. 

In the potential-variables causal literature, the 
treatment effects defined above would be considered 
causal effects provided certain assumptions hold 
(e.g., Rubin, 1990, 2005, 2010). To avoid the on¬ 
going debate about the nature of causality, we will 
refrain from referring to treatment effects as causal 
effects. 


5. INFERENCE APPROACHES AND 
ASSUMPTIONS 

The {y,s,t) components in the observed data 
y\t.s\ are viewed as outcomes of the sequential gen¬ 
erations of (1). The complete, but only partially ob¬ 
served, data y is a realization of the 2 y-dimensional 
vector of potential variables y. In symbols, we have 
y[t.s\ ^ y [r-5] and y^Y. 

As stated previously, the inference goal is to use 
the observed data y[t.s} to reduce uncertainty about 
one of three targets: the distribution of y, the vector 
y[l.P, 2.P], or the vector y[l.s, 2.s]. The choice of in¬ 
ference approach depends on which of these targets 
we are interested in and it depends on what assump¬ 
tions we can reasonably make about the joint distri¬ 
bution of (y,5,r), where Y_ is the “process” vari¬ 
able and (5,r) are the “selection” variables. More 
specifically, 5 is the “sampling” variable and T is the 
“randomization” or treatment assignment variable. 
In this paper, we consider three candidate inference 
approaches. 

5.1 Process-Based Inference and Assumptions 


4.2 “No-Treatment-Effect” Hypotheses 

Corresponding to each treatment effect dehnition, 
there is a “no-treatment-effect” hypothesis. As ex¬ 
amples, 

: y[l-R] =R[ 2 .F], with probability 1 ; 

<^^:y[l.z]~y[2.f],z = l,...,iV. 

Herein, “~” means “distributed as”; 

Hi^^:EiY[l.i]) = E{Y[^.i]), 

i = l,...,N] 

H^^^:y[l.P]=y[2.P]-, 

■y[i-P]=m-P]; 

H^^^:y[l.i=y[ 2 .s]- 

H^^^:y[l.s]=y[2.i. 

The indentations are used to denote nesting. For 
example, both and are implied by 

■ Similarly, is implied by and by 

P[q^, but not by . The superscripts remind 

us of the type of treatment effect used in the hy¬ 
pothesis. For example, the hypothesis uses 

Fxpected f/nit-specific effects (for the Population), 
and uses Realized Aggregate (over sample s) 

effects. 


With the process-based approach, we condition on 
the selection (only Y_ is random) and use 

#i]^m5]|(5 = s,r = t) 

~ Y[t-s]\{S = s,T = t) 

to carry out inferences about the distribution of Y_. 
(The discussion section describes more general in¬ 
ferences.) 

With process-based inference, we generally must 
make assumptions about the conditional distribu¬ 
tion of Y_\{^ = s,T = t). However, because this pa¬ 
per focuses on test procedures that are valid pro¬ 
vided the process is independent of the selection 
(see assumption Ai), we will only make assumptions 
about the (unconditional) process distribution of y; 
see assumptions A 2 -A 7 . 

Ai:{S,T)ALY; 

A 2 : y[t.z] ~Pt, i = l,...,N,t = l,2-, 

A 3 : y[l.i, 2 .z], i = are independent; 

A 4 : Ft € {continuous c.d.f.s}; 

A^iFte c.d.f.s}; 

AQ:Ft€ {iV(yt,a^) c.d.f.s}; 

Aj : Ft € {c.d.f.s with mean and variance (/Zt,(7j)}. 
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Process-based inference is simplified under As¬ 
sumption Ai because the observed data can be 
viewed as a realization of ; in symbols, y[t.s] •(— 
P[t.s]. It follows that we need only model the (un¬ 
conditional) distribution of the process variable Y^. 
Importantly, under Ai , process-based inference does 
not require any assumptions about the selection 
(S,r). It is also important to note that when Ai 
holds and y is modeled via assumptions such as A 2 - 
Ay, the observed data y[t.s} can be used to make 
process-based inferences about the distribution of 

y. 

Unfortunately, Assumption Ai is typically not 
tenable in practice. Notice that Ai is equivalent to 
the two assumptions, T _LL y |5 and S _LL y. When 
mechanical randomization is used to assign treat¬ 
ments to the sampled units, the first assumption 
can be made tenable. However, the reasonableness of 
the second assumption, that the sampling variable 
5 is independent of y , is often questionable in prac¬ 
tice. For example, with haphazard or convenience 
sampling, rather than probability sampling, it often 
turns out that 5 and y are not independent. The 
dependence typically stems from sampling on the 
basis of covariates that are related to y.^ 

The assumption A 2 is not as restrictive as it may 
initially appear. For example, whenever the identi- 
hers are arbitrarily assigned to the N population 
units, the N pairs y[l.f,2.f] would be exchangeable 
and, hence, A 2 would hold. Generally, the more ten¬ 
uous assumptions are Ai, that the selection is car¬ 
ried out independently of the process, the indepen¬ 
dence assumption A 3 , and the assumptions A/^-A^ 
about the form of marginal distributions Fi and F 2 . 

5.2 Randomization-Based Inference and 
Assumptions 

With the randomization-based approach, we con¬ 
dition on both the process values and the sample 
(only T is random) and use 

y[Ts]^y[r.5]|(y = y,5 = s) 

~ y[ZA]|(Z = y,i5 = s) 

to carry out inferences about y[l.s, 2 .s]. 

With randomization-based inference, we generally 
must make assumptions about the conditional dis¬ 
tribution of r| (y = y, 5 = s). However, because this 


^Of course, if the covariates responsible for the dependence 
were known and observable, we could condition on their values 
to restore independence; however, this conditional model falls 
outside the purview of the current paper. 


paper focuses on test procedures that are valid pro¬ 
vided the randomization is conditionally indepen¬ 
dent of the process, given the sample (see assump¬ 
tion Hi), we will only make assumptions about the 
distribution of T\{^ = s)] see assumption H 2 . 

Hi :r^y|5. 

H 2 : The distribution of T| (H = s) is completely 
known and satisfies... 

P{T.S3t.Sj\S = £)>0, 

j = l,...,n,t=l, 2 ; 

P{T.S3t.Sj,T.S3t'.Sf\S = s)=0 

if and only if f 7 ^ Tandj = j'. 

Randomization-based inference is simplihed un¬ 
der assumption Hi and fortunately the use of me¬ 
chanical randomization makes this assumption ten¬ 
able. Under Hi, we have that the observed data 
can be viewed as y[t.s} f—y[T.s]|(H = s), so only 
the distribution of T|(H = s) needs to be modeled. 
In particular, we need not make any assumption 
about the distribution of Y_ or its relation to H; 
for example, Y_ and H, that is, the process and the 
sampling, need not be independent. It is important 
to note that when the distribution of T \(S = s) 
is completely known (see H 2 ), the distribution of 
I(j5 = s) is known up to the partially-observed 
values y[l.s, 2 .s], which are the parameters of inter¬ 
est for randomization-based inference. Thus, when 
Hi and H 2 hold, the observed data y[t.s} can be used 
to carry out randomization-based inference about 
the target parameters y[l.s, 2 .s]. 

In H 2 , the probabilities are called first- and 
second-order inclusion probabilities for the random 
sample, namely, T.H|(H = s), taken from (l.s, 2 .s). 
Assumption H 2 imposes constraints on these inclu¬ 
sion probabilities. The positive first-order inclusion 
probabilities imply that “proper” randomization is 
used to assign treatments, that is, each unit in the 
sample has a positive probability of receiving ei¬ 
ther treatment; we say that this is a “proper” ran¬ 
domized comparative experiment.'^ Put another way, 
T.-S\{^ = s) is a probability sample from (l.s, 2 .s). 


^The two-treatment completely randomized design (CRD) 
experiment is a special-case example of a proper randomized 
comparative experiment. With the CRD, r|(5)= s) has a uni¬ 
form distribution over all possible rearrangments of ni I’s and 
n 2 2’s (cf. Cox, 1958b, pages 71-72; Kempthorne, 1977, Sec¬ 
tion 8; or Cox and Reid, 2000, Section 2.2.4.) 
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Because the same unit cannot be assigned differ¬ 
ent treatments, the second-order inclusion proba¬ 
bilities with t ^ t' and j = j' are 0. This implies 
that the probability sample is nonmeasurable, to 
use language from sampling theory (cf. Sarndal et 
ah, 1992, pages 32-33). This nonmeasurability com¬ 
plicates the computation of certain randomization- 
based test statistics (see Section 6.3.2 below), as 
Neyman was fully aware of in 1923. 

5.3 Selection-Based Inference and Assumptions 

With the selection-based approach, we condition 
on the process values [only (S,r) is random] and 
use 

vM ^ i:[r.S]|(y = y) ~ y[T.S]\{Y = y) 

to carry out inferences about y[l.^, 2 .P]. 

With selection-based inference, we generally must 
make assumptions about the conditional distribu¬ 
tion of (5,T)|(y = y). However, because this paper 
focuses on test procedures that are valid provided 
the selection is independent of the process (see as¬ 
sumption Cl), we will only make assumptions about 
the unconditional distribution of (S, T); see assump¬ 
tion 6 * 2 . 

Cl : {S,T) ALY. 

C 2 '■ The distribution of (5, T) is completely known 


selection-based inference about the target param¬ 
eters y[l.H, 2 .P]. 

As discussed in the randomization-based section, 
assumption C 2 imposes constraints on first- and 
second-order inclusion probabilities. In this case, the 
random sample T_.^ is taken from (l.H, 2.P). The 
assumption implies that each of the 2N elements 
in (l.H, 2 .P) has a positive probability of being se¬ 
lected. Thus, the random sample is a probability 
sample. The 0 second-order inclusion probabilities 
imply that the probability sample is nonmeasurable. 

6. TESTS OF THE NO-TREATMENT-EFFECT 
HYPOTHESIS 

This section describes a collection of process- 
, randomization-, and selection-based tests of no 
treatment effect hypotheses. Some of these tests are 
well known (e.g., the two-sample t test), and some 
are less well known (e.g., the Neyman randomiza¬ 
tion test). In any case, we will emphasize the as¬ 
sumptions needed for their applicability and we will 
carefully state the hypothesis that is actually being 
tested. We begin by introducing a difference statistic 
that forms the basis of most of the tests considered 
in this paper. 

6.1 The Difference Statistic 


and satishes... 

P{T.S3t.i)>{), 

i = l,...,N,t = l,2. 

if and only lit^t' and i = i'. 

Selection-based inference is simplihed under as¬ 
sumption Cl because the observed data can be 
viewed as y\t.s\ ■<— y[T.5]. It follows that we need 
only specify the (unconditional) distribution of 
the selection (5,r); no assumptions about T are 
needed. Unfortunately, as discussed in the process- 
based subsection above, assumption Ci is not usu¬ 
ally tenable in practice because the sampling and 
process are often dependent. It is important to note 
that when the distribution of (-S,T) is completely 
known (see C 2 ), the distribution of y[T.^ is known 
up to the partially-observed values y[l.£, 2 .P], 
which are the parameters of interest for selection- 
based inference. Thus, when Ci and C 2 hold, the 
observed data y[t.s] can be used to carry out 


With the exception of the Wilcoxon rank sum 
statistic, this paper will focus on test statistics that 
are based on the following difference statistics; 


Di{Y[t-s])=D{Y,s,t,w,), 


process 

DsiZ) = D{y,s,T,W3), 

'-V-' 

randomization 

D 23 {S,T) = D{y,S,T,w^^), 


selection 


where 


N 


D{y,s,t,w) = ^ 


i=l 


y[l.i]l{Ls 3 l.i) 
w[l.i] 


( 2 ) 


weighted avg of trt 1 values 
N 


E 

2 = 1 


y[2.i]l{ps 3 2.i) 
w[2.i] 


weighted avg of trt 2 values 
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The candidate values for weights w include 
ULi[t-i] = nt, Wsit-i] = nP{T.s 3 = s), 

wi23[^-^] = NP{T.^ 3 t.i), 

where n = length(s),nt = — ^)- From the 

discussions in Sections 5.2 and 5.3, it follows that the 
and W 2 ^ components are multiples of first-order 
inclusion probabilities (cf. Sarndal et ah, 1992), us¬ 
ing language from finite-population sampling theory. 
By convention, we set 0/0 = 0 in (2). 

There are several useful properties of these D 
statistics. First, note that 

s,Tm) can be computed using only 
the observed values y[t.s], s, and t. 

That is, D, and hence each of Hi, ZI 3 , and H 23 , is an 
observable statistic. It also follows that the process- 
based statistic Hi depends on T only through .s], 
hence the notation Hi(y[t.s]). Second, the process- 
based statistic Hi is simply the difference between 
the unweighted sample averages 

and "Fhe randomization- and 

the selection-based statistics H 3 and H 23 are dif¬ 
ferences between probability-weighted sample aver¬ 
ages. Third, 

Under Ai ,Hi | (<S = s, T = t) 

has distribution that depends only on the model 
for y[t.s]. 

Under B,,D 3 \{Y = y,S = s) 

(3) has distribution that depends only on the 
y[l.s, 2 .s] values and the T|(5 = s) distribution. 

Under Ci,H 23 |(y = y) 

has distribution that depends only on the 
y[l.H, 2 .P] values and the (5,r) distribution. 
Fourth, 

Under Ai and A 2 , 

E{D, \S = s,T = t) = H(Hi) = //I - // 2 . 

Under Hi and H 2 , 

(4) E{D 3 \Y = y,S = s) = E{D 3 \S = s) 

= y[i-s] -y[ 2 .s]. 

Under Ci and C 2 , 

E{D 23 \Y = y) = H(H23) = y[l.H] - y[2.H]. 


Here, yt = E{Y]t.i]) is the mean of the assumed 
common distribution F). The last two expectation 
results follow because H 3 and H 23 are Horvitz- 
Thompson probability-weighted estimators 
(Horvitz and Thompson, 1952; Sarndal et ah, 1992, 
page 43). These expectation results highlight the 
usefulness of basing tests of “no treatment effects” 
on these H statistics, at least when the treatment 
effect is measured in terms of differences in means. 
These results also highlight the usefulness of random 
sampling and treatment randomization. 

6.2 Process-Based Tests 

With the process-based approach, we condition on 
the selection (only Y_ is random) and use 

#i]^m5]|(5 = s,r = t) 

~ Y[t.s]\{S = s,T = t) 

to carry out inferences about the distribution of 
y. Among other assumptions, the validity of the 
process-based tests described below generally re¬ 
quire that assumptions Ai: (5,r) TL y; A 2 : Y_[t.i] ~ 
F),i = 1 ,..., y, t = 1 , 2 ; and A 3 : y[l.i, 2 .i],i = 1 ,..., 
N are independent hold. As noted in Section 5.1, 
these assumptions are often untenable in practice,'^ 
so the reader is reminded to apply these tests with 
caution. 

6.2.1 Permutation test Consider the no-treatment- 
effect hypothesis 

:Y[l.i]r~^Y[2.i], i = l,...,N. 

Under Hq = {Ai, A 2 , A 3 ,Hq^^), we can state the 
null as Fi = F 2 and base our test on 

^i|(i:[Ts]€n(y[Fs])) 

which has a known, 

computable distribution under Hq. 

Here n(x) = {set of distinct permutations of x}. 
That this distribution is known under Hq follows 
because in this case 

i:[i.^l(i:[tA]en(y[t.s])) 

( 5 ) 

~ uniform over points in n(y[Fs]). 


^The tests are often invalid because the selection is related 
to the process, the treatment-specific process variables are not 
identically distributed, and/or the process variables are not 
independent across units. 
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The computability follows because Di{x) can be 
computed for any x G n(y[t.s]). 

In practice, we would report a one- or two-sided 
p-value. For example, letting -Di,obs = Di{y[t.s}) be 
the observed difference, a two-sided p-value can be 
defined as 

pval(T>i,obs) 

= Pho{\Di\> |I?i,obs||Z[i-s] en(y[t.s])). 

The size of the test that rejects Hq iff pval < a is less 
than or equal to a. If we observe a p-value < a and 
we assume that and A^ hold, then we have 

statistical evidence against Fi=F 2 , that is, 

evidence at the a level that F\^F 2 . 

Remark: At hrst glance, one might think that ex¬ 
changeability of the N pairs y[l.i, 2 .i] could replace 
(^ 2 ,^ 3 ). Unfortunately, a stronger exchangeability 
assumption would be needed to guarantee the uni¬ 
form permutation distribution of (5). Specifically, 
the assumption must lead to the exchangeability of 
the n components of y[t.s]- Along these lines, we 
could consider a more restrictive no-treatment-effect 
hypothesis, for example, all 2N components 

in y [l.R, 2.P] are exchangeable. Then the permuta¬ 
tion test would be valid under F[q = {Ai,Hq^^*). It 
is useful to note that can be viewed as 

along with extra assumptions about the process dis¬ 
tribution. In this sense, this strong exchangeability 
hypothesis is an example of the no-treatment-effect 
hypotheses considered herein. 

6.2.2 Wilcoxon rank sum test Consider the no- 
treatment-effect hypothesis 

Under F[q = {Ai, A 2 , A 3 , A^, where A 4 is the 

assumption that the c.d.f.s Ft are continuous, we can 
state the null as Fi = F 2 and base our test 

on 

n 

iy = iy(R) = = 

where kUKR G n(r)) 

has a known, computable distribution under Flo- 

Here Rj = rank(y[tj.Sj]) and Vj = rank(?/[tj.Sj]), 
where the ranks are taken over the n values in T[t.s] 
and ?/[t.s], respectively. Again, the n(r) is the set of 
permutations of r. That this distribution is known 
under Hq follows because in this case 

( 6 ) R\{R G n(r)) ^ uniform over points in n(r). 


The computability follows because R(x) can be com¬ 
puted for any x G n(r). 

In practice, we would report a one- or two-sided 
p-value. For convenience, let kFobs = ^(l) be the 
observed rank sum statistic. Then the following is 
a reasonable two-sided p-value (of course there are 
others): 

pval(iyobs) = 2min{PHg(W > Wohs\R G n(r)), 

HHo(bF<iyobs|^Gn(r))}. 

Assuming that Ai~A 4 hold, an observed p-value < 
a would give us statistical evidence against 
Fi = F 2 , that is, evidence at the a level that Fi / 
F 2 . 


6.2.3 Two-sample t tests Consider the no- 
treatment-effect hypothesis 

: EiY[l.i]) ~ £;(y[2.i]), i = 1,..., iV. 

Under Hq = (Ai, A 2 , Aq, A^, where A 5 states 

that we are sampling from N{nt,o't) distributions, 
we can state the null as Hq^^: = 1 x 2 and base 

our test on 

where u is Welch’s formula for the approximate de¬ 
grees of freedom and t^n) is Student’s t distribution. 
The standard error has the familiar form 


SEiDx) 



where is the sample variance of the {U[t.Sj] : tj = 
t}. Because Di is simply the difference between the 
two unweighted sample averages, this statistic T is 
identical to Welch’s (1938) version of the two-sample 
t statistic. 

An approximate two-sided p-value can be com¬ 
puted as 


Pho{\T\ > |Tobs|) -P{\t{t^)\ > l^obsl) =apval(robs)- 


Assuming that A 1 -A 3 and A 5 hold, an approximate 
p-value < a gives statistical evidence against Hq^^: 
1 ^ 1 = t^ 2 , that is, evidence at the approximate a level 
that [i. 2 - 

Under Hq = (Ai,A 2 ,A 3 ,Ag,FF(f^^), where Aq 
states that we are sampling from N{nt,(j‘^) distribu¬ 
tions, we can state the null as Hq^^: = 1 x 2 and 

base our test on 


T =^L- 

^ SEp{Dx) 


Ho 


t{v), 


u = ni-\-n2 — 2. 
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The standard error has the familiar form 


SEp{Di) 



form y[l.Sj] = y[2.Sj],j = 1,..., n, or, in sim¬ 

pler notation, 


where is the pooled estimate ((ni — 1)?^ -|- (n 2 — When Hq = {Bi, B 2 , ) holds, we have by (4) 

)/(^1 + ^2 — 2). The statistic Tp is the standard that E{D'^ 1^ = s) 0 and we can base a test of Hq 

two-sample pooled t statistic. on 

An exact two-sided p-value can be computed as 1 / ^ n 

Dz\{S = s) 

Pho{\Tp\ > |Tp^obs|) E{\t{i')\ > |Tp^obs|) which has a known, computable distribution under Hq. 


= pval(rp^obs)- 

Assuming that A 1 -A 3 and Ag hold, an exact p-value 
< a gives statistical evidence against pi = 

P 2 ) that is, evidence at the approximate a level that 

Under the less restrictive assumption, Hq = (Ai, 
A 2 ,A 3 ,A 7 ,where A 7 states that we are 
sampling from any distributions Ft with mean yt 
and variance we can still use T to test 

= fJ' 2 , but the actual size of the tests based on 
the p-value, which uses the t approximation, may 
be far from the nominal a. In practice, the approxi¬ 
mation is usually reasonable when n is large enough 
to compensate for any asymmetry in the underlying 
Ft distributions. 

6.3 Randomization-Based Tests 

With the randomization-based approach, we con¬ 
dition on both the process and the sample (only T 
is random), and use 

y[Ts]^y[r.5]|(i: = y,5 = s) 

~ y[T-s]\{y = y,S = s) 

to carry out inferences about y[l.s, 2.s]. The 
randomization-based test procedures outlined be¬ 
low are valid provided the assumptions Bi and B 2 
of Section 5.2 hold. There it was pointed out that 
these two assumptions can be made tenable when 
the treatment assignment is carried out by mechan¬ 
ical randomization. 

6.3.1 Fisher randomization test Although R. A. 
Fisher never explicitly used potential variables, sev¬ 
eral authors, including Welch (1937), Rubin (1990, 
2005), and Cox (2009), have suggested that he tac¬ 
itly used the no-unit-specific-effects (or sharp null) 
hypothesis in this randomized comparative experi¬ 
ment setting. That is, it has been suggested that to 
Fisher the no-treatment-effect hypothesis had the 


This null distribution is known because ZI 3 has 
form D 3 = and assumption B 2 tells us 

that the distribution of T|(*S = s) is known. It is 
computable because under Bi, the distribution of 
-C* 3|(*5 = s) depends only on ?/[l.s, 2 .s] and under 
the observed data y[t.s\ determines the col¬ 
lection y[l.s, 2 .s]. 

An exact two-sided p-value can be computed as 
pval(T)3,obs) 

= PHo{\D3\>\Ds,obs\\S = s) 

= Pho{T^{x:\D3{x)\ > |Zl3,obs|}|i5 = s). 

If we assume that Bi and B 2 hold, an exact p-value 
< a gives statistical evidence against y[l.s] = 

y[2.s}, that is, evidence at the a level that y[l-Sj\ 7 ^ 
p[2.Sj] for at least one subject Sj. This test is called 
a Fisher randomization test because it is based on 
the randomization approach and it was described by 
Fisher (1935). 

This Fisher randomization test based on D 3 is tai¬ 
lored to detect differences between p[l.s] and y[2.s}. 
To detect other differences, such as scale differences 
between the 2 /[l.s] and y[ 2 .s], an alternative to D 3 
should (and can easily) be used. 

Attractive features of this Fisher randomization 
test include the following: it has size guaranteed to 
be no larger than a; it is valid when the sampling 
depends on the process (<S /L R); it does not require 
a model for the process variables Y_; and it does not 
require an estimate of the variance, var(Il 3|5 = s). 

Randomization vs. Permutation P-values: It is 
clear that this Fisher randomization test is concep¬ 
tually very different from the process-based permu¬ 
tation test. Indeed, as a rule, the randomization p- 
value based on D 3 is numerically different than the 
permutation p-value based on Di. In fact, even if we 
had based both p-values on the same statistic Di, 
the p-values would generally be different. There is 
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an exception to this rule. Consider the special case 
uniform randomization distribution, 

P{T = x\S = s) = ^^ 

\n 

where rii is the number of I’s in t and n(t) is the set 
of all rearrangements of rii I’s and n 2 = n — ni 2’s. 
This is the randomization used in the special case 
two-treatment completely randomized design (e.g., 
Cox, 1958b, pages 71-72; Kempthorne, 1977, Sec¬ 
tion 8). In this case, Di and are numerically 
identical, and the randomization and permutation 
p-values are numerically identical. It is this identity 
that often leads practitioners to incorrectly conclude 
that the process-based permutation test is identical 
to the randomization test. See Ernst (2004) for an 
interesting discussion. 

6.3.2 Neyman randomization test Compared to 
the view attributed to Fisher, Neyman was more 
interested in detecting nonzero treatment effects of 
the aggregate variety, especially y[l.s] — y[2.s]. He 
apparently found it less practically useful to detect 
unit-specihc effects if the average effect was 0. For 
this reason, Neyman used the no-average-effect hy¬ 
pothesis (cf. Welch, 1937). That is, Neyman viewed 
the no-treatment-effect hypothesis as 

H^^^-.y[l.s]=y[ 2 .s]. 

Because D Neyman’s approach fo¬ 

cused on a narrower set of alternatives than Fisher, 
thereby opening up the possibility of finding a test 
with higher power than the Fisher randomization 
test, at least for alternatives of practical (in Ney¬ 
man’s view) interest. 

When Hq = holds, we have by (4) 

that E{Ds\^ = s)^ 0 and we can consider basing a 
test of Hq on 



D3\{S = s) 

which has a known, but noncomputable, 
distribution under Hq. 

This null distribution is known because D 3 = D 3 {T) 
and T| (5 = s) has a known distribution. It, however, 
is not computable because it depends on y[l.s, 2.s], 
which is not determined by the observed data y\t.s\ 
under the no-average-effect hypothesis Hq^^ . In 
contrast, recall that under the more restrictive unit- 
specific (or sharp) null H^^^, the observed data did 
determine y[l.s, 2.s]. 


Neyman was clearly aware of this noncomputabil¬ 
ity issue and instead invoked a central limit theorem 
and used 


Z 3 = ZsiT) = 


D 3 

SE{D3\S = s) 


where = s) ^ approx iV(0,1). 


Here, SE is a standard error, which is an estimator 
of the standard deviation, sd(Zl3|5 = s). The stan¬ 
dard deviation can be computed using sampling the¬ 
ory as described in Sarndal et al. (1992). However, 
finding a reasonable estimator SE of this standard 
deviation is more difficult because of the 0 second- 
order inclusion probabilities. Toward this end, Ney¬ 
man (1923) derived a reasonable estimator of a tight 
upper bound for the variance under simplifying as¬ 
sumptions on the inclusion probabilities (Gadbury, 
Rubin, 2001, 1990; see Copas, 1973, for a related 
result). It is useful to note that the variance at¬ 
tains this upper bound when unit-treatment addi¬ 
tivity holds, that is, y[l.Sj] =y[2.Sj] -|-constant, j = 
1,... ,n. In this paper, we use the Neyman estima¬ 
tor of variance. The square root of this estimator is 
SE{D3\S = s). 


Remark. There is a related approximate nor¬ 
mality result when H^^^ does not hold. Under 
{Bi,B 2 ), we noted in (4) that F(Zl3|5 = s) = 
y[l.s] — y[2.s] and we have that sd(Il3|5 = s) is ap¬ 
proximated by SE(D 3 \^ = s). By the central limit 
theorem and continuous mapping results, we have 

-D3 - (y[i-s]-y[2.s]) AT/n-i\ 

SE(D,IS^,) = 

This result is useful for testing other hypotheses and 
for computing confidence intervals. 


The Normal approximation for Z3 generally im¬ 
proves as the number of support points in T\{^ = s) 
increases. However, when the differences y[l.Sj] — 
y[2.Sj] are highly variable, the unit variance in the 
approximation can be a substantial overestimate 
(see Gadbury, 2001), and when y[l.Sj] — y[2.Sj] = 
constant, the unit variance can be a slight underes¬ 
timate when the sample sizes are small (based on 
observations from the simulation study carried out 
for this paper). 

An approximate two-sided p-value can be com¬ 
puted as 

PHo{\Z3\ > |^3,obs||^=s) «P(|iV(0,l)| > I •^3,obs I) 

= apval(2'3^obs)- 
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If Bi and B 2 hold, an approximate p-value < a gives 
statistical evidence against y[l.s] = y[2.s], 

that is, evidence at the approximate a level that 
y[l-.£] 7^ y[2-s]. This test is called a Neyman random¬ 
ization test because it is based on the randomization 
approach and ideas in Neyman (1923). 

Unlike the Fisher randomization test of the 

size of the Neyman test of is not guaranteed to 

be less than or equal to ce; it is only approximately 
size a. For smaller ni and n 2 and when the more 
restrictive hypothesis holds, the Neyman ran¬ 

domization test tends to be anti-conservative, with 
size a bit larger than the nominal a. This follows be¬ 
cause the Neyman estimator of the variance tends to 
slightly underestimate the true variance in this case. 
For moderate ni and 77-2 the approximation is usu¬ 
ally reasonable provided D^{T^ has enough support 
points with respect to the T|(5 = s) distribution. 
We empirically explore this approximation below. 

6.4 Selection-Based Tests 

With the selection-based approach, we condition 
on the process values [only (S,r) is random] and 
use 

vM ^ i:[T.5]|(y = y) ~ y[T.S]\{Y = y) 

to carry out inferences about y[l.£, 2.P]. The 
selection-based test procedures outlined below are 
valid provided the assumptions Ci and C 2 of Sec¬ 
tion 5.3 hold. There it was pointed out that these 
two assumptions are often untenable, so the follow¬ 
ing test procedures must be applied with caution. 

6.4.1 Fisher selection test The no-unit-specific- 
treatment-effect (or sharp null) hypothesis in this 
selection-based setting has the form y[l.i] = 

y[2.i],i = 1,..., A^, or, more simply, 

■.y[l.P]=y[2.P\. 

When Hq = {Ci,C 2 ,H^^^) holds, we have by (4) 

that E{D 2 z) ^ 0 and we can consider basing a test 
of Hq on 

D23\{S=£) 

which has a known, but noncomputable 
distribution under Hq. 

This null distribution is known because D 23 has 
form D 23 = D 23 {^,T) and assumption C 2 tells us 
that the distribution of (5,r) is known. It is, 
however, not computable because it depends on 


y[l.P, 2.P], which is not determined by the observed 
data y[t.s\ under the hypothesis To see this, 

note that for s' 7^ s, there is an s'j such that both 
y[l.s'j] and y[ 2 .s'j] are unobserved and hence not 
computable even under . 

It follows that an exact Fisher selection test is not 
available in this selection-based setting. We could 
condition on the sample and be content using the 
Fisher randomization test to draw inferences about 
y[l.s, 2.s] rather than y[l.£, 2.P]. Alternatively, we 
could use the approximate selection-based test de¬ 
scribed in the next subsection. 


6.4.2 Neyman selection test In analogy to the 
randomization setting, Neyman likely would con¬ 
sider the no-average-effect hypothesis: 

■.y[l.P]=y[2.P]. 

When Hq = {Ci,C 2 ,H^^^) holds, we have by (4) 

that E{D 23 ) ^ 0 and, analogous to the randomiza¬ 
tion setting, we can base a test of Hq on 


^23 = Z23iS,T) 


D 23 

SE{D23) 


where Z 23 


Ho 


approx A^(0,1). 


Just as with sd(iJ3|5 = s) in the randomization ap¬ 
proach, the standard deviation sd(iJ 23 ) can be com¬ 
puted and estimated using sampling theory. The es¬ 
timation, however, is subject to the same problems 
as in the randomization approach because of the 
nonmeasurability of probability sample T.H Suf¬ 
fice it to say that a reasonable Neyman estimator 
SE{D 23 ), analogous to the one in the randomiza¬ 
tion setting, exists. 

The approximate Normality result follows just as 
in the randomization setting. Specifically, under Ci 
and C 2 , and using the same arguments as in the ran¬ 
domization approach, we have that quite generally 


D23-{m-P]-y[‘^-P]) 

SE (023) 


I'N-' approx A^(0,1). 


The approximation generally improves as the num¬ 
ber of support points in T-^ increases. However, 
when the differences y[l.i] — y[ 2 .i] are highly vari¬ 
able, the unit variance in the approximation can be 
a substantial overestimate (see Gadbury, 2001). 

An approximate two-sided p-value can be com¬ 
puted as 


-Ph-o(|.^ 23| > 1^23,obsl) ~ T’(|A^(0, 1)1 > |.^23,obs|) 
= apval(Z23,obs)- 
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If Cl and C 2 hold, an approximate p-value < a gives 
statistical evidence against y[l.P] =y[ 2 .P], 

that is, evidence at the approximate a level that 
y[^-E\ This test is called a Neyman selec¬ 

tion test because it is based on the selection ap¬ 
proach and ideas in Neyman (1923). 

Just as in the randomization setting, the size of 
the Neyman test of is not guaranteed to be 

less than or equal to ce; it is only approximately 
size a. Remarks regarding the approximation in this 
selection setting are analogous to those given at the 
end of Section 6.3.2, in the randomization setting. 

7. EMPIRICAL INVESTIGATIONS 
7.1 Cell Phone Use Example (Revisited) 

The process variable Y_[t.i] is defined as the reac¬ 
tion time for the ith unit in population P when ex¬ 
posed to treatment t. Inference about the process Y_ 
distribution will be difficult to describe because the 
sample of 64 students was not taken from any well- 
defined population P. For any substantively inter¬ 
esting population, for example, P = licensed drivers 
in Utah, the assumption that 5 _LL T is untenable 
given the haphazard nature of the sample selection. 
The untenability of 5 _LL T also implies that it will 
be difficult to carry out inferences about the pop¬ 
ulation values y[l.P, 2.P] for any substantively in¬ 
teresting population P. For these reasons, it makes 
sense to focus on inferences about the 128 potential 
values in ?/[l.s, 2.s]. That is, it is arguably better to 
use randomization-based inference for this example. 

We assume that the randomization was carried 
out mechanically so that T _LL T|P and we assume 
that the distribution of T\(^ = s) is uniform in the 
sense of (7); that is, conditions Pi and P 2 of Sec¬ 
tion 6.3 are assumed to hold. We will use the Fisher 
randomization test to test the no-treatment-effect 
hypothesis ?/[l.Sj] = y[2.Sj],j = 1,... ,64 and 

the Neyman randomization test to test the no- 
treatment-effect hypothesis y[l.s] =y[ 2 .s]. 

For these data, the observed randomization statis¬ 
tics are 

51 50 

-Ds.obs = 51.59, ^ 3 ,obs = ^ 

pval(P 3 ^obs) = 0.0074 and 
apval(Z3^obs) = 0.0075. 

Because the Fisher randomization p-value 
pval(P 3 ^obs) = 0.0074 is small, we have sufficient ev¬ 
idence to reject there is statistical evidence 


that y[l.Sj\ 7 ^ y[2.Sj] for at least one subject in the 
sample of 64. Because the Neyman randomization 
p-value apval(Z 3 ^obs) = 0.0075 is small, we have suf¬ 
ficient evidence to reject there is statisti¬ 

cal evidence that y[l.s] /y[2.s]. In fact, because 
Ps.obs = 51.59 is a Horvitz-Thompson unbiased es¬ 
timate of y[l.s] — y[2.s], the Neyman test gives sta¬ 
tistical evidence that the reaction time values are 
higher on average when cell phones are used, at 
least for this sample of 64. In other words, there is 
statistical evidence of a treatment effect. 

For completeness and for comparison purposes, we 
also give the values of the other commonly used 
p-values, viz., permutation, Wilcoxon, Welch’s ap¬ 
proximate t, and the pooled t: 

pval(Pi,obs) = 0.0074, pval(IFobs) = 0.0184, 

apval(robs) = 0.0110 and pval(rp^obs) = 0.0107. 

Strictly speaking, these are only applicable for 
process-based inference, so they are of questionable 
utility for this example. As noted above, because the 
randomization distribution is uniform, the permu¬ 
tation p-value pval(Pi^obs) is numerically (but not 
conceptually!) identical to the Fisher randomization 
p-value pval(P 3 ,obs)- 

All the computations were carried out in R. 
The author has written code to compute the Ney¬ 
man randomization p-value. The Fisher randomiza¬ 
tion and permutation p-values were approximated 
using Monte-Carlo estimation (here we used 10® 
simulations) as carried out in twot .permutation 
{DAAG}. The Wilcoxon p-value was computed using 
Wilcox.test {stats}. Note that when there are 
ties, as there are in this example, wilcox.test only 
reports approximate p-values. 

7.2 A Simulation Study 

This section empirically compares the operating 
characteristics of the different tests considered in 
this paper, under a variety of scenarios. All com¬ 
putations were carried out in R, with p-values com¬ 
puted as described at the end of the previous sub¬ 
section. The simulated data are generated according 
to models of the form 

y[l.i] ^Y_[l.i] IID ~ [scenario], 

y[2.i] ^ y.[2.i] ~ [scenario], i = l,...,N, 

s^mY = y)^P{S=il,...,n)\Y = y) = l, 

( 8 ) 

where n = N, 
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t^T\iy = y,S = s)r^P{T = t'\Y = y,S = s) 


where n = ni + n 2 and T is the set of all possible 
rearrangements of ni I’s and n 2 2’s. Looking back at 
the process-based assumptions of Section 5.1, we see 
that Ai holds, but none of A 2 -Aj is guaranteed to 
hold. Both the randomization-based assumptions Bi 
and B 2 of Section 5.2 hold, as do both the selection- 
based assumptions Ci and C 2 of Section 5.3. A more 
extensive simulation would also investigate scenarios 
where more of the assumptions do not hold. 

For data-generation models of the form (8), we 
have that (i) the randomization- and selection-based 
approaches are identical because the sample 5 is 
taken to be equal to the population P with proba¬ 
bility one; and (ii) the permutation and Fisher ran¬ 
domization p-values are numerically (not conceptu¬ 
ally!) identical because the randomization distribu¬ 
tion (the distribution of T) is uniform over the set 
of all possible treatment assignments. 

Although the permutation-, Wilcoxon-, and t- 
tests are process-based approaches, we will esti¬ 
mate their operating characteristics for both the 
process and randomization (here, randomization = 
selection) distributions. Similarly, the Fisher and 
Neyman randomization tests are randomization- 
based approaches, but we report their operating 
characteristics for both the process and the ran¬ 
domization distributions. In the tables below, the 
rows labeled “Randomization” give Monte-Carlo es¬ 
timates of the power of the tests over the distribu¬ 
tion T|(R = y,S = s). The rows labeled “Process” 
give Monte-Carlo estimates of the power of the tests 
over the distribution Y_\{^ = s,T = t). In all cases, 
the nominal size is set at a = 0.05. 

The simulation results in Tables 3-6 give us a 
glimpse at the operating characteristics of the tests 
for a variety of scenarios, labeled “Sc. The fol¬ 
lowing summary focuses on comparisons between 
the Fisher and Neyman randomization tests, but the 
table entries afford broader comparisons. 

For small ni,n 2 , when y[l.Sj] — y[2.Sj] = constant, 
the Neyman randomization test tends to be just a 
bit anti-conservative for testing that is, the 

actual size appears to be a little larger than the 
nominal size (see scenarios 1, 2, and 6 of Table 3). 
This anti-conservativeness presumably stems from 
the fact that the Neyman estimator of the variance, 
var(Z)3|5 = s), tends to be slightly biased on the 


low side when y[l.Sj] — y[2.Sj] = constant. For larger 
ni,n 2 , this anti-conservativeness disappears (scenar¬ 
ios 1, 2, and 6 of Table 5). 

When the differences — y[2.Sj] are highly 

variable, the Neyman randomization test tends to be 
a bit conservative for testing , although not as 
conservative as the Fisher randomization test (sce¬ 
narios 4 and 7 in Tables 3 and 5). This conservative¬ 
ness presumably stems from the fact that the Ney¬ 
man estimator of the variance, var(Z)3|5 = s), tends 
to be biased on the high side when y[l.Sj\ — y\2.Sj\ 
are highly variable (see Cadbury, 2001). 

For small ni,n 2 , the Normal approximation to 
the Neyman test statistic can be unreasonable when 
there are extreme outliers present (scenario 3 of Ta¬ 
ble 3). With larger ni,n 2 , the Normal approxima¬ 
tions become more reasonable in the presence of ex¬ 
treme outliers (scenario 3 of Table 5). 

In all of the simulation scenarios, the Neyman ran¬ 
domization test had higher power than the Fisher 
randomization test (see Tables 4 and 6), especially 
when ni,n 2 are smaller (see Table 4). Of course, 
power comparisons are most useful when both tests 
have the same size. Because neither of these tests has 
size exactly equal to the nominal 0.05, these power 
comparisons should be considered carefully. In par¬ 
ticular, in head-to-head comparisons, the Fisher test 
is at a disadvantage because its actual size is guar¬ 
anteed to be no larger than 0.05; the Neyman test 
has size that is only approximately equal to, and can 
exceed, the nominal 0.05. 

On the basis of this limited simulation study, we 
recommend that practitioners at least think seri¬ 
ously about using the Neyman randomization test 
as an alternative to the Fisher randomization test, 
especially when ni,n 2 are moderate, say, at least 10, 
and when there are no extreme outliers. 

8. DISCUSSION 

This paper used concepts from the rich litera¬ 
tures on causal analysis and finite-population sam¬ 
pling theory to clear up some of the confusion 
that exists about tests of the no-treatment-effect 
hypothesis in the randomized comparative experi¬ 
ment setting. Our approach lends itself to explicit 
specifications of the candidate no-treatment-effects 
hypotheses and targets of inference. We clearly 
distinguished between three main inference ap¬ 
proaches: process-based, randomization-based, and 
selection-based. The commonly-used permutation 
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Table 3 

Monte-Carlo estimates of size when ni = n 2 = 10, nominal size= 5% 


m = 712 = 10 

Permutation“ 

Wilcoxon 

t (Welch) 

t (Pooled) 

risher“ 

Neyman 


rrUP 

^0 


^Y[l.i] 

IID~N(10,2*^) 





Sc. 1 

true 

y[2-*] 

^Yl2.i] 

= y;[i.i],i = i,...,20 






Randomization 


4.6 

3.6 

4.7 

4.7 

4.6 

6.5 


Process 


4.3 

3.4 

4.2 

4.3 

4.3 

6.9 


ttUP 

-no 


^T[l.i] 

IID ~ Gamma(shape 

= 1, scale = 5) 




Sc. 2 

true 

y[2-*] 

^Y[2.i] 

= y[l.i],i = 1,...,20 






Randomization 


5.0 

4.9 

4.1 

4.6 

5.0 

7.4 


Process 


4.0 

4.1 

3.2 

3.5 

4.0 

7.7 


ttUP 

no 

y[^A 

^Y[l.i] 

IID~0.9[/(0,20)-b0.1[/(200,201), ' 

■‘mixture of uniforms” 


Sc. 3 

true 

y[2i] 

^Yj2.i] 

= y[l.i],i = 1,...,20 






Randomization** 


4.6 

3.9 

0.0 

0.0 

4.6 

0.0 


Process 


3.8 

3.5 

1.1 

1.8 

3.8 

11.2 


rjEUP rcfRAs 
-no 5-^0 

y[^A 

^T[l.i] 

IID~iV(10,22) 





Sc. 4 

true 

y[^A 

^Y_[2.i] 

= Y[l.i]+Ei-E,Ei IID~iV(0,32) 

,i = l,...,20 




Randomization 


1.5 

1.9 

1.7 

1.8 

1.5 

3.3 


Process 


2.7 

2.0 

2.5 

2.6 

2.7 

4.2 


■rrEUP rrRAs 
-no 5 -no 

y[i-*] 

^Y[l.i] 

IID ~ Gamma(shape 

= 1, scale = 5) 




Sc. 5 

true 

y[2-*] 

^Y_[2.i] 

= 2T[l.i] -y[l.P],i = 

= 1,...,20 





Randomization 


4.8 

6.8 

4.3 

4.4 

4.8 

7.6 


Process 


4.0 

7.4 

3.6 

3.7 

4.0 

7.4 


nr 


^Y[l.i] 

IID~bin(l,0.28) 





Sc. 6 

true 

y[2-*] 

^Y[2.i] 

= y;[l.i],i = 1,...,20 






Randomization“ 


0.0 

NA 

9.1 

9.1 

0.0 

9.1 


Process 


2.1 

NA 

4.5 

4.5 

2.1 

11.3 


rjDUP ttRAs 
-no 5 -no 

y[^A 

^Y[l.i] 

IID~‘*bin(l,0.28) 





Sc. 7 

true 

y[2-*] 

^Y[2.i] 

IID^** bin(l, 0.28), corr(y[l.i],y[2.i]) = 0.37,1 = 1,. 

..,20 



Randomization® 


0.4 

NA-'' 

1.6 

1.6 

0.4 

4.6 


Process 


0.2 

NA 

1.0 

1.0 

0.2 

3.7 



Table entries give the rejection rates (as a percent) for the 1000 simulations. 

All indented hypotheses are also true; see Section 4.2. For example, in row 1, is true. It follows that all the other 

hypotheses in Section 4.2 are also true. 

“For this simulation, the permutation and Fisher randomization test results are numerically identical. 

^The fixed y includes one large observation from the [/(200,201) distribution. 

“The fixed y[l.P] = 00000100000000011010 = y[2.£]. 

‘^This is an approximation because the Y_ values are adjusted to satisfy . 

“The fixed y[l.P] = 1001010100100100000 0, y[2.P] = 0000110100100110000 0. 

■^Because of the many ties in the binomial case, the Wilcoxon test as described herein is not applicable. 


test, Wilcoxon rank sum test, and two-sample t tests 
are examples of process-based approaches. Exam¬ 
ples of randomization-based approaches include the 
commonly-used Fisher randomization test and the 
less commonly-used Neyman randomization test. 
We also described a Neyman selection test. A small- 
scale empirical comparison of these different tests 
was carried out. On the basis of the simulation re¬ 


sults, we recommend that practitioners consider us¬ 
ing the Neyman randomization test in certain sce¬ 
narios. 

In our description of the process-based approach, 
we focused on testing hypotheses about the distri¬ 
bution of Y_. More generally, the process-based ap¬ 
proach can be used to both estimate, or test hy¬ 
potheses about, characteristics of the distribution 
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Table 4 

Monte-Carlo estimates of power when ni=n 2 = 10, nominal size = 5% 


m = n2 = 10 

Permutation“ 


Wilcoxon 

t (Welch) 


l(Pooled) 

Fisher“ 

Neyman 


ttEUP tjRAs 

«/[!■ 

.i\^Y_[l.i] 

IID- 

' A(10,2*2) 






Sc. 1 

false 

y[2. 

i] ^ Yp.i] 

= y[i 

A]+2,i=l, 

...,20 






Randomization 


52.7 


49.3 

51.3 


52.5 

52.7 

59.9 


Process 


55.9 


51.6 

55.5 


56.1 

55.9 

62.7 


rjEUP ttRAs 

5 -^0 

y[i' 

A] t— T[l.i] 

IID- 

' A(10,22) 






Sc. 2 

false 

y[2. 

A] <«-T[2.i] 

= y[i 

."i] + 2 + Ei - 

-E,Ei IID- 

o' 

32),*=!,... 

,20 



Randomization 


26.2 


23.6 

24.1 


25.7 

26.2 

35.6 


Process 


28.6 


23.8 

26.1 


27.2 

28.6 

36.6 


rjEUP jcfRAs 

-^0 5-^0 

y[i' 

A] t— T[l.i] 

IID- 

' A(10,22) 






Sc. 3 

false 

y[2. 

A] 4- Yp.i] 

= i. 2 y[i.i],i = i,. 

..,20 






Randomization 


34.7 


27.1 

34.8 


35.3 

34.7 

43.0 


Process 


48.4 


43.0 

47.5 


48.4 

48.4 

57.2 


rjEUP tjRAs 

y[i' 

A] 4— y[l.i] 

IID- 

' Gamma(shape = 1, scale 

= 5) 




Sc. 4 

false 

y[2. 

A] 4r- Yp.i] 

= 2 y[i.i],i = i,... 

,20 






Randomization 


19.2 


12.9 

16.1 


18.7 

19.2 

28.6 


Process 


30.2 


23.7 

23.4 


26.0 

30.2 

38.5 


zrEUP jcfRAs 

y[i' 

A] 4— y[l.i] 

IID- 

' Gamma(shape = 1, scale 

= 5) 




Sc. 5 

false 

y[2. 

A] 4- Yp.i] 

= 3Y[lA] + Ei,Ei 

IID~iV(0,5 

%i = 

:!,... ,20 




Randomization 


45.7 


28.6 

40.2 


45.3 

45.7 

65.5 


Process 


49.2 


38.1 

39.8 


44.4 

49.2 

63.5 


rrEUP tjRAs 

y[i' 

A\4^Y_[1A] 

IID- 

bin(l,0.28) 






Sc. 6 

false 

y[2. 

A] 4r- y [2.i] 

IID- 

bin(l,0.71),i 

corr(y [l.i],y [2.i]) 

= 0.29,1 = 1, 

...,20 



Randomization** 


18.9 


NA 

37.3 


37.3 

18.9 

37.4 


Process 


29.6 


NA 

48.0 


48.0 

29.6 

50.3 



Table entries give the rejection rates (as a percent) for the 1000 simulations. 

“For this simulation, the permutation and Fisher randomization test results are numerically identical. 
‘'The fixed y[l.P] = 0000010010100001101 0, y[2.P] = 01111110101100111011. 


of Y_ and predict/estimate the unobserved values 
y[—t.s}. Here, y[—H] is the collection of all 2N com¬ 
ponents of y excluding those with subscripts in the 
set A. A look back at the assumptions Ai-Ay shows 
that we did not have to specify a model for the joint 
distribution of Y_ to carry out a test of no treatment 
effect. We only assumed independence across units 
and modeled the marginal distributions of and 
y[2.i]. In contrast, the prediction of unobserved val¬ 
ues generally requires a model for the joint distribu¬ 
tion of equivalently, a model for (H[t.s],y[—t.s]), 
the “(^bs) hjnis)” of Rubin (e.g., 2005). Rubin ad¬ 
vocates using a Bayesian approach to process-based 
prediction of y[—t.s]. 

This paper restricted attention to inferences about 
one population or sample, under two scenarios cor¬ 
responding to two treatments. Owing to randomiza¬ 
tion, we were able to compare these two treatment 


scenarios; for example, see equation (4). Compar¬ 
ing two populations of distinct units is a qualita¬ 
tively different inference problem. However, similar 
notation and model structures can be used to study 
this problem as well. Interestingly, in this two pop¬ 
ulation setting, Fisher randomization tests, as de¬ 
scribed herein, are generally not applicable. In con¬ 
trast, the other tests described in this paper, includ¬ 
ing the Neyman selection test, are applicable. 

The notation and model structure introduced in 
this paper can be directly applied in more general 
settings where nonuniform or constrained random¬ 
ization is used or where there are more than two 
treatments being compared; see, for example, the 
descriptions in Sutter et al. (1963), Kempthorne 
(1977), and Bailey (1981). There are extensions in 
other directions. For example, rather than testing 
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Table 5 

Monte-Carlo estimates of size when m = n2 = 50, nominal size= 5% 


m = 712 = 50 

Permutation“ 

Wilcoxon t (Welch) 

t (Pooled) 

risher“ 

Neyman 


rrUP 

^0 


^Y[l.i] 

IID~A7(10,2’^) 




Sc. 1 

true 

y[2-*] 

^Yl2.i] 

= T[l.i],i = 1,...,100 





Randomization 


4.0 

4.0 4.0 

4.0 

4.0 

4.4 


Process 


4.7 

4.8 4.8 

4.8 

4.7 

5.5 


ttUP 

-no 


^T[l.i] 

IID ~ Gamma(shape = 1, scale = 5) 




Sc. 2 

true 

y[2.*] 

^Y[2.i] 

= y[l.i],i = 1,...,100 





Randomization 


4.9 

5.0 4.8 

4.8 

4.9 

5.4 


Process 


4.1 

3.9 3.9 

3.9 

4.1 

4.8 


ttUP 

no 

2/[14] 

^Y[l.i] 

IID~0.9P(0,20)-b0.1P(200,201), ' 

■‘mixture of uniforms” 


Sc. 3 

true 

y[2.i] 

^Yj2.i] 

= y[l.i],i = 1,...,100 





Randomization** 


4.2 

6.5 4.3 

4.5 

4.2 

8.6 


Process 


5.3 

5.4 5.3 

5.3 

5.3 

6.6 


rjEUP rcfRAs 
-no 5-^0 

2/[14] 

^Y[l.i] 

IID~7V(10,22) 




Sc. 4 

true 

y[24] 

^Yj2.i\ 

= Y[l.i]+Ei-E,Ei IID~7V(0,32) 

,i = l,...,100 




Randomization 


2.5 

3.1 2.4 

2.4 

2.5 

3.4 


Process 


3.0 

4.6 2.9 

3.2 

3.0 

3.9 


■rrEUP rrRAs 
-no 5 -no 

y[i4] 

^T[l.i] 

IID ~ Gamma(shape = 1, scale = 5) 




Sc. 5 

true 

y[24] 

^Y_[2.i] 

= 2T[1.7] -y[l,P],i = 1,...,100 





Randomization 


4.6 

42.5 4.4 

4.4 

4.6 

6.1 


Process 


2.8 

35.1 2.8 

2.8 

2.8 

5.0 


rrUP 

-no 

J/[14] 

^Y[l.i] 

IID~bin(l,0.28) 




Sc. 6 

true 

y[24] 

^Y[2.i] 

= y[14],i = 1,...,100 





Randomization“ 


2.2 

NA 5.5 

5.5 

2.2 

5.5 


Process 


3.5 

NA 5.0 

5.0 

3.5 

5.9 


rjDUP ttRAs 
-no 5 -no 

y[i4] 

^Y[l.i] 

IID~‘*bin(l,0.28) 




Sc. 7 

true 

y[24] 

^Y[2.i] 

IID^** bin(l, 0.28), corr(y[l.i],y[2.i]) = 0.37,1 = 1,. 

..,100 



Randomization® 


0.5 

NA 0.8 

0.8 

0.5 

1.0 


Process 


1.6 

NA 2.8 

2.8 

1.6 

3.5 



Table entries give the rejection rates (as a percent) for the 1000 simulations. 

“For this simulation, the permutation and Fisher randomization test results are numerically identical. 
^The fixed y includes 7 large observatious from the [7(200,201) distribution. 

“The fixed y[l.,P] = y[2.P] with y[l.P] =y^-F\ = 32/100. 

‘^This is an approximation because the Y_ values are adjusted to satisfy 

“The fixed y is such that y[l.P] T^t/p.P], y[l.P] =y[2.P] = 33/100, and corr(i/[l.P],i/[2.P]) = 0.186. 


hypotheses, the ideas introduced in this paper show 
promise for confidence interval estimation. More 
work in this direction will be forthcoming. 

In the binary response, comparative experiment 
setting, Fisher’s exact test for 2 x 2 tables (see 
Agresti, 2002, page 91) is equivalent to the Fisher 
randomization test of when T _LL ^|5 and 

T|(5 = s) have a uniform distribution as in (7); re¬ 
call that states that the binary response val¬ 

ues satisfy y[l.Sj\ = ?/[2.Sj], j = 1,... ,n. Fisher’s ex¬ 


act test is also numerically equivalent to the process- 
based permutation test of when (5,r) -FLY_ 

and y[t.z]indep ~ bin(l,7rt); here is equiva¬ 

lent to TTi = 7r2. In fact, in the simulation (scenarios 
6 and 7 of Tables 3 and 5, and scenario 6 of Tables 4 
and 6) , because of the uniform randomization distri¬ 
bution, we were able to use the R code for Fisher’s 
exact test, fisher.test {stat}, to compute the 
exact values of the Fisher randomization and per¬ 
mutation p-values. On a related note, we point out 
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Table 6 

Monte-Carlo estimates of power when ni =n 2 = 50, nominal size= 5% 


m = 712 = 50 

Permutation“ 


Wilcoxon 

t (Welch) 


t (Pooled) 

Fisher“ 

Neyman 


jirEUP ttRAs 

y[l.i] ^Y[l.i] 

IID- 

■77(10,2^) 






Sc. 1 

false 

y[2.i]^Y[2.i] 

=i:[i 

.i] + l,i = l,.. 

.,100 






Randomization 

80.9 


76.4 

80.4 


80.4 

80.9 

81.3 


Process 

69.5 


67.7 

69.9 


69.9 

69.5 

70.4 


rjEUP ^RAs 

y[l.i] ^Y_[l.i] 

IID- 

■N(10,2=) 






Sc. 2 

false 

y[2.i]^Y[^.i] 

=i:[i 

.i] + 1 + Ei — . 

E,Ei IID- 

'N(0, 

32),i = l,... 

,100 



Randomization 

36.3 


31.4 

36.2 


36.4 

36.3 

42.7 


Process 

37.9 


36.3 

37.5 


38.0 

37.9 

42.8 


jcrEUP ttRAs 

5 -^0 

y[l.i] ^Y_[l.i] 

IID- 

■N(10,2=) 






Sc. 3 

false 

y[2.j]^y[2.i] 

= l.iy[l.i],i = l,... 

,100 






Randomization 

70.5 


68.6 

71.0 


71.1 

70.5 

72.1 


Process 

66.6 


63.9 

65.7 


65.7 

66.6 

67.4 


rjEUP rcfRAs 

5-^0 

y[l.i] ^Y_[l.i] 

IID- 

' Gamma(shap 

e = 1, scale 

= 5) 




Sc. 4 

false 

y[2.i]^YX^.i] 

= 1.5y[l.i],i = l,... 

,100 






Randomization 

46.6 


39.0 

46.2 


46.4 

46.6 

49.5 


Process 

49.2 


40.6 

48.0 


48.2 

49.2 

51.8 


ttEUP rcfRAs 

y[l.i] ^Y_[l.i] 

IID- 

' Gamma(shap 

e = 1, scale 

= 5) 




Sc. 5 

false 

y[2.j]^y[2.i] 

= 1.7>Y_\l.i]+Ei,Ei 

IID~N(0, 

.5=),* 

= 1,...,100 




Randomization 

41.0 


35.2 

40.4 


40.5 

41.0 

44.3 


Process 

39.2 


30.7 

38.9 


39.0 

39.2 

44.2 


jirEUP ttRAs 
-^0 5 -no 

y[l.i]^Y_[l.i] 

IID~ 

bin(l,0.28) 






Sc. 6 

false 

y[ 2 .i]^y[ 2 .i] 

IID~ 

bin(l,0.50),corr(y[l.i], y [2.i]) 

= 0.36,i = 1, 

...,100 



Randomization*' 

48.8 


NA 

58.8 


58.8 

48.8 

60.3 


Process 

51.1 


NA 

60.1 


60.1 

51.1 

60.3 



Table entries give the rejection rates (as a percent) for the 1000 simulations. 

“For this simulation, the permutation and Fisher randomization test results are numerically identical. 

^The fixed y is such that y[l.P] ^ 2/[2.P], = 24/100, y[2.P] = 45/100, and corr( 2 /[l._Q,2/[2.P]) = 0.386. 


that the Neyman randomization test is also available 
for testing the no-treatment-effect hypothesis : 
y[l.s] =y[2.s] in 2 x 2 tables. This paper’s simula¬ 
tion results suggest that when the randomization 
distribution is uniform as in (7), this Neyman ran¬ 
domization test for 2x2 tables may be somewhat 
more powerful than Fisher’s exact test. 
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